1 Setup

Figures and table are saved in the ./figures/ and ./tables/ directories, respectively.

# Define folders structure and create folders if needed
figures_dir <- "./figures/"
tables_dir <- "./tables/"
dir.create(figures_dir,showWarnings = F)
dir.create(tables_dir,showWarnings = F)

1.1 Parameters

We set some parameters for rendering.

knitr::opts_chunk$set(
                      # Display PNG but also keep PDF (reverse the order for PDF outputs)
                      dev = c("png", "pdf"), 
                      # In general, keep all the figures produced in a chunk
                      fig.keep = "all",      
                      # Save all figures
                      fig.path = figures_dir, 
                      dpi = 72,
                      fig.width = 14, fig.height = 10,
                      # Cache chunk results and use cache if the chunk hasn't
                      # been changed since last rendering
                      cache = FALSE)

options(repr.plot.width = 14, repr.plot.height = 10)

1.2 Library loading

We need ggplot2 and patchwork for the plot as well as biomaRt for gene ids conversion.

library(Seurat)
## Loading required package: SeuratObject
## Loading required package: sp
## 
## Attaching package: 'SeuratObject'
## The following objects are masked from 'package:base':
## 
##     intersect, t
library(ggplot2)
library(patchwork)
library(biomaRt)

1.3 Color palettes

We set two colorblind-friendly palettes for the plots.

small_color_pal <- ggthemes::colorblind_pal()(8)
large_color_pal <- c("#F7F398", "#AA9A59", "#E63863", "#ff6db6", "#91D0BE", "#F3B1A0", "#57C3F3", "#AB3282", "#BD956A", 
                     "#8C549C", "#6778AE", "#F1BB72", "#53A85F", "#9FA3A8", "#712820", "#58A4C3", "#E0D4CA", "#E4C755",
                     "#8C9A69", "#F28D35", "#44A08D", "#E6A8D7", "#D9B0C4")

1.4 Data loading

We load the Seurat object containing processed scRNA-seq data.

sobj <- readRDS("seurat_subset_test.rds")

2 Data exploration

2.1 Seurat object characteristics

This Seurat object contains 5,000 cells within 4 assays

  • standard RNA expression assay

  • Spliced, spliced, unspliced and ambiguous transcript expression assays that have certainly been obtained using the velocyto tool. They can be used to investigate cell dynamics with a RNA velocity analysis.

sobj
## An object of class Seurat 
## 131299 features across 5000 samples within 4 assays 
## Active assay: RNA (33355 features, 2000 variable features)
##  3 layers present: counts, data, scale.data
##  3 other assays present: spliced, unspliced, ambiguous
##  2 dimensional reductions calculated: pca, umap

The RNA slot has been processed with a standard Seurat workflow to obtain a pca latent space from 2000 Highly variable genes which was used to cluster the cells and produce a UMAP visualization.

names(sobj@commands)
## [1] "NormalizeData.RNA"        "FindVariableFeatures.RNA"
## [3] "ScaleData.RNA"            "RunPCA.RNA"              
## [5] "FindNeighbors.RNA.pca"    "FindClusters"            
## [7] "RunUMAP.RNA.pca"

2.2 Cell type annotation

We can assume that the cell type annotation present in the object is derived from this analysis. This cell type annotation presents 2 clusters of interneurons, the cell type of interest.

col_cell_type <- large_color_pal
names(col_cell_type) <- levels(sobj$Manual_type)

df_cluster_type <- unique(sobj@meta.data[,c("Manual_type","seurat_clusters")])


col_clusters <- large_color_pal
names(col_clusters) <- df_cluster_type[order(df_cluster_type$Manual_type),"seurat_clusters"]

plot_manual_type <- DimPlot(sobj,group.by = c("Manual_type"),cols = col_cell_type) 
plot_clusters <- DimPlot(sobj,group.by = c("seurat_clusters"),cols = col_clusters) 
wrap_plots(plot_manual_type,plot_clusters,ncol = 1)

2.3 Other cell annotations

The metadata contains other cell annotations.

sobj@meta.data[] <- lapply(sobj@meta.data, function(x) if(is.character(x)) as.factor(x) else x)
summary(sobj@meta.data)
##  orig.ident   nCount_RNA      nFeature_RNA    sample_sampleId_short
##  hft:5000   Min.   :   642   Min.   : 501   hft_w16_p7_r2: 861     
##             1st Qu.:  3018   1st Qu.:1528   hft_w16_p7_r1: 766     
##             Median :  4530   Median :2040   hft_w24_p6_r2: 689     
##             Mean   :  5768   Mean   :2249   hft_w21_p5_r2: 593     
##             3rd Qu.:  6795   3rd Qu.:2732   hft_w20_p3_r2: 530     
##             Max.   :147544   Max.   :9889   hft_w21_p5_r1: 524     
##                                             (Other)      :1037     
##  sample_time  sample_cellLine sample_assay    sample_batch   percentRibo     
##  pcw16:1627   HFT3:1050       RNA NG:2833   b2019_06:1050   Min.   :0.01381  
##  pcw20:1050   HFT5:1117       RNA v3:2167   b2020_02:2323   1st Qu.:0.12283  
##  pcw21:1117   HFT6:1206                     b2020_03:1627   Median :0.15494  
##  pcw24:1206   HFT7:1627                                     Mean   :0.15547  
##                                                             3rd Qu.:0.18616  
##                                                             Max.   :0.41943  
##                                                                              
##  seurat_clusters                            Manual_type  
##  c0     : 993    Migrating glutamatergic neurons 1: 993  
##  c1     : 460    Interneurons 1                   : 460  
##  c2     : 409    Migrating glutamatergic neurons 2: 409  
##  c3     : 378    Interneurons 2                   : 378  
##  c5     : 332    Glutamatergic neurons 2          : 332  
##  c4     : 329    Glutamatergic neurons 1          : 329  
##  (Other):2099    (Other)                          :2099
barplot(summary(sobj@meta.data$sample_sampleId_short))

In summary we have 8 samples profiled at 4 different time points of the cortical development (week 16, 20, 21 and 24). There appear to be 2 replicates for each time point. The cells are also annotated for 3 different batches and two different assays (RNA v3, RNA NG).

table(sobj$sample_batch,sobj$sample_time)
##           
##            pcw16 pcw20 pcw21 pcw24
##   b2019_06     0  1050     0     0
##   b2020_02     0     0  1117  1206
##   b2020_03  1627     0     0     0
table(sobj$sample_cellLine,sobj$sample_time) #this two annotation are equivalent
##       
##        pcw16 pcw20 pcw21 pcw24
##   HFT3     0  1050     0     0
##   HFT5     0     0  1117     0
##   HFT6     0     0     0  1206
##   HFT7  1627     0     0     0
DimPlot(sobj,group.by = c("sample_sampleId_short","sample_time",
                          "sample_assay","sample_batch"),cols = small_color_pal)

The cells cluster by batch on the UMAP visualization questioning the reliability of the given cell annotation. The batch annotation partially overlaps with the assay and sample time annotations. We have here a highly complex experimental design with potential batch effect that will be highly difficult to distinguish from meaningful biological variations during the cortical development.

Further information about how this data were generated is needed to decide whether or not a batch correction should be implemented and at which level(s) (sample, batch, assay).

table(sobj$sample_batch,sobj$sample_time)
##           
##            pcw16 pcw20 pcw21 pcw24
##   b2019_06     0  1050     0     0
##   b2020_02     0     0  1117  1206
##   b2020_03  1627     0     0     0
table(sobj$sample_batch,sobj$sample_assay)
##           
##            RNA NG RNA v3
##   b2019_06      0   1050
##   b2020_02   1206   1117
##   b2020_03   1627      0

2.4 Quality control

As we only have the Seurat object, we don’t have the capture and sequencing reports that would allow us to have a complete view of the data quality. We can still perform basic quality control.

2.4.1 Cell quality

We create a new RNA assay with ENSG ids converted to HGNC gene names keeping only genes with at least one count.

rna_with_hgnc <- sobj[["RNA"]]$counts[rowSums(sobj[["RNA"]]$counts) > 0,]
                         
                         
mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"),verbose = F)
ensg_to_hgnc <- getBM(filters= "ensembl_gene_id", 
                      attributes= c("ensembl_gene_id","hgnc_symbol"),
                      values=rownames(rna_with_hgnc),
                      mart= mart,
                      verbose = F)
## Batch submitting query [=====>-------------------------] 20% eta: 1mBatch
## submitting query [===========>-------------------] 40% eta: 1mBatch submitting
## query [==================>------------] 60% eta: 37sBatch submitting query
## [========================>------] 80% eta: 14s
duplicated_ensg <- ensg_to_hgnc$ensembl_gene_id[duplicated(ensg_to_hgnc$ensembl_gene_id)]
ensg_to_hgnc[ensg_to_hgnc$ensembl_gene_id == duplicated_ensg,] 
# when more than one gene name is found we keep only the first one
ensg_to_hgnc <- ensg_to_hgnc[!duplicated(ensg_to_hgnc),]

rna_with_hgnc <- rna_with_hgnc[ensg_to_hgnc$ensembl_gene_id,]
ensg_to_hgnc$gene_name <- ensg_to_hgnc$hgnc_symbol
# we keep ensg id when no gene name is available
ensg_to_hgnc$gene_name[ensg_to_hgnc$gene_name == ""] <- ensg_to_hgnc$ensembl_gene_id[ensg_to_hgnc$gene_name == ""]
rownames(rna_with_hgnc) <- make.unique(ensg_to_hgnc$gene_name)

sobj[["RNA_HGNC"]] <- CreateAssay5Object(counts = rna_with_hgnc)
DefaultAssay(sobj) <- "RNA_HGNC"
sobj <- NormalizeData(sobj)
## Normalizing layer: counts
Idents(sobj) <- "orig.ident"

# We have a percentRibo stat computed on the whole dataset (not subsetted)
# We can recompute this stat for the subssetted dataset
# We can also add the stat for the percentage of mitochondrial transcripts
sobj[["percent.mt"]] <- PercentageFeatureSet(sobj, pattern = "^MT-")
sobj[["percent.ribo"]] <- PercentageFeatureSet(sobj, pattern = "^RPS|^RPL")


VlnPlot(sobj, features = c("nFeature_RNA", "nCount_RNA","percent.mt","percent.ribo"), ncol = 2)

We have many cells with high percentages (> 15 %) of counts originating from mitochondrial (or ribosomal) transcripts. These cells are potentially of bad quality and should be filtered.

min(sobj$nFeature_RNA)
## [1] 501
max(sobj$nFeature_RNA)
## [1] 9889
min(sobj$nCount_RNA)
## [1] 642
max(sobj$nCount_RNA)
## [1] 147544

Furthermore, some cells with a high number of detected genes could be doublets. A cutoff of a minimum of 501 detected genes appears to have been applied to this dataset.

Further clarification is needed regarding how genes and cells were filtered, if at all. These preliminary results suggest that many low-quality cells may still be present. We can consider applying commonly used thresholds for the metrics mentioned above to retain only high-quality cells.

Additional advanced quality controls, such as automated doublet detection, could also be performed to further exclude dubious cells.

2.4.2 Cell Cycle

Using signatures provided by Seurat, we can assign each cell to a specific cell cycle phase (G2M/S or G1/G0) to assess potential confounding effects of the cell cycle on the processed data.

s_genes <- cc.genes$s.genes
g2m_genes <- cc.genes$g2m.genes
sobj <- CellCycleScoring(sobj, s.features = s_genes, g2m.features = g2m_genes, set.ident = TRUE)
## Warning: The following features are not present in the object: MLF1IP, not
## searching for symbol synonyms
## Warning: The following features are not present in the object: FAM64A, HN1, not
## searching for symbol synonyms
p1 <- DimPlot(sobj,group.by = "Phase",cols = small_color_pal[2:4]) 
p2 <- DimPlot(sobj,group.by = c("Manual_type"),cols = col_cell_type,label = T,repel = T) 

wrap_plots(p1,p2,ncol = 1)

Radial glia and progenitor cells cluster by cell cycle phase in this processed dataset, suggesting that the cell cycle is a major driver of gene expression in these cell types, potentially obscuring finer biological differences.

This issue can be addressed using various approaches, with cell cycle regression being the most common.

3 Data analysis : Molecular Characterization of Interneurons

Given the primary focus on interneurons, I propose starting with an initial analysis that relies on the provided annotations to identify markers for the different annotated cell types, particularly for the two interneuron clusters.

3.1 Marker analysis

## RNA_HGNC assay has been normalized before
Idents(sobj) <- "Manual_type"
all_markers <- FindAllMarkers(sobj, only.pos = T,logfc.threshold = 0.25,min.pct = 0.1,verbose = F)
all_markers <- all_markers[all_markers$p_val_adj <0.05,]
write.csv(all_markers,file = "./tables/all_markers_original_anno.csv",quote = F,row.names = F)

all_markers[all_markers$cluster == "Interneurons 1",]
all_markers[all_markers$cluster == "Interneurons 2",]

We can plot known interneuron markers from the literature, identified through this analysis, such as DLX genes, CXCR4, or ERBB4.

common_interneuron_markers <- intersect(all_markers[all_markers$cluster == "Interneurons 2","gene"],
          all_markers[all_markers$cluster == "Interneurons 1","gene"])
common_interneuron_markers
##   [1] "ERBB4"     "DLX6-AS1"  "PDZRN4"    "DLX2"      "CXCR4"     "GAD1"     
##   [7] "SP9"       "PLS3"      "DLX5"      "ARX"       "NRXN3"     "GAD2"     
##  [13] "PDE4DIP"   "C11orf96"  "RIPOR2"    "DLX1"      "RBP1"      "SOX1"     
##  [19] "DLX6"      "ARL4D"     "DCX"       "CEMIP2"    "NPAS1"     "SLAIN1"   
##  [25] "ZNF536"    "INA"       "SLC32A1"   "BTG1"      "TMEM123"   "MAFB"     
##  [31] "CADPS"     "PFN2"      "FGD3"      "ARL4C"     "MOB3B"     "ST8SIA5"  
##  [37] "ZEB2"      "TCF4"      "CELF4"     "SLC6A1"    "LINC01305" "BCL11B"   
##  [43] "QKI"       "SOX4"      "NNAT"      "VSTM2A"    "MEG3"      "NPAS3"    
##  [49] "PNOC"      "NREP"      "MTSS1"     "DPYSL3"    "GPD1"      "KIF2A"    
##  [55] "EIF1"      "SHTN1"     "RARA"      "MYO1B"     "MYCBP2"    "DSCAML1"  
##  [61] "TMSB10"    "FTH1"      "TLE5"      "YBX1"      "H3-3B"     "JAKMIP2"  
##  [67] "FEZ1"      "DCLK2"     "CCDC88A"   "NCAM1"     "WLS"       "CRMP1"    
##  [73] "PDZRN3"    "CITED2"    "CHL1"      "STMN2"     "ANKRD12"   "HSPA1A"   
##  [79] "EIF4G2"    "NTN4"      "AKAP9"     "ARRDC3"    "CHD9"      "MLLT11"   
##  [85] "PNRC1"     "SLCO5A1"   "PAIP2"     "DDX6"      "DCLK1"     "DST"      
##  [91] "DAPK1"     "SOX2"      "RPS29"     "EIF5"      "CD24"      "CADM1"    
##  [97] "ZNF91"     "TXN"       "ATP1B1"    "ZBTB16"    "KLF7"      "NSG1"     
## [103] "WNT7A"     "GAS2L3"    "RERE"      "DNAJB6"    "ZNF428"    "TSC22D3"  
## [109] "CRACD"     "ZNF704"    "CCSER1"
FeaturePlot(sobj, features = c("DLX1","DLX2","CXCR4","ERBB4"),reduction = "umap")

3.2 Differentially Expressed Genes (DEG) between interneuron clusters 1 and 2

We can already observe expression differences for some genes between the two interneuron clusters. Let’s identify all differentially expressed genes between interneuron clusters 1 and 2.

DEG_interneurons_1_vs_2 <- FindMarkers(sobj, ident.1 = "Interneurons 1",ident.2 = "Interneurons 2",
                                       logfc.threshold = 0.25,min.pct = 0.1) 
DEG_interneurons_1_vs_2 <- DEG_interneurons_1_vs_2[DEG_interneurons_1_vs_2$p_val_adj < 0.05,]
DEG_interneurons_1_vs_2$cluster <- "Up in Interneurons 1"
DEG_interneurons_1_vs_2$cluster[sign(DEG_interneurons_1_vs_2$avg_log2FC) < 0] <- "Up in Interneurons 2"
write.csv(DEG_interneurons_1_vs_2,
          file = "./tables/deg_interneurons_1_vs_2_original_anno.csv",quote = F)

DEG_interneurons_1_vs_2

We can plot the top three significantly highly expressed genes for each interneuron type

top_interneurons_1 <- rownames(DEG_interneurons_1_vs_2[DEG_interneurons_1_vs_2$cluster ==  "Up in Interneurons 1",])[1:3]
top_interneurons_2 <- rownames(DEG_interneurons_1_vs_2[DEG_interneurons_1_vs_2$cluster ==  "Up in Interneurons 2",])[1:3]

VlnPlot(sobj[,sobj$Manual_type %in% c("Interneurons 1","Interneurons 2")],features = c(top_interneurons_1,top_interneurons_2),cols = col_cell_type,stack = T,fill.by = "ident",flip = T)

The two interneuron clusters exhibit clear differences in gene expression. Further DEG analysis could be conducted to investigate changes in gene expression between these two clusters at different time points during cortical development. This will require careful consideration of the complex experimental design, for example, by using EdgeR with pseudobulks of the different samples.

RNA velocity analysis, combined with pseudotime analysis, could provide valuable insights into common or distinct progenitors for these two types of interneurons during cortical development. However, this will be a highly challenging analysis due to the experimental design and the batch effects observed in the processed data.

4 Conclusion

Overall, this processed data presents a complex design with confounding factors, including batch effects at different levels and cell cycle. No corrections appear to have been applied during the processing of this data to obtain the cell type annotations.

If the two clusters annotated as interneurons exhibit known markers, I would like to discuss whether a complete reanalysis of this data—ideally starting from the raw FASTQ files (which were not provided)—should be performed, with potential corrections for confounding factors. This would allow for more reliable and potentially finer annotations of the interneuron subtypes. Such a reanalysis could serve as the basis for the more advanced analyses discussed above.

4.0.1 Session info

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Debian GNU/Linux trixie/sid
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.28.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] biomaRt_2.62.1     patchwork_1.3.0    ggplot2_3.5.1      Seurat_5.2.1      
## [5] SeuratObject_5.0.2 sp_2.2-0          
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3      jsonlite_1.8.9          magrittr_2.0.3         
##   [4] spatstat.utils_3.1-2    farver_2.1.2            rmarkdown_2.29         
##   [7] zlibbioc_1.52.0         vctrs_0.6.5             ROCR_1.0-11            
##  [10] memoise_2.0.1           spatstat.explore_3.3-4  progress_1.2.3         
##  [13] htmltools_0.5.8.1       curl_6.2.0              sass_0.4.9             
##  [16] sctransform_0.4.1       parallelly_1.42.0       KernSmooth_2.23-26     
##  [19] bslib_0.9.0             htmlwidgets_1.6.4       ica_1.0-3              
##  [22] httr2_1.1.0             plyr_1.8.9              plotly_4.10.4          
##  [25] zoo_1.8-12              cachem_1.1.0            igraph_2.1.4           
##  [28] mime_0.12               lifecycle_1.0.4         pkgconfig_2.0.3        
##  [31] Matrix_1.7-2            R6_2.5.1                fastmap_1.2.0          
##  [34] GenomeInfoDbData_1.2.13 fitdistrplus_1.2-2      future_1.34.0          
##  [37] shiny_1.10.0            digest_0.6.37           colorspace_2.1-1       
##  [40] AnnotationDbi_1.68.0    S4Vectors_0.44.0        tensor_1.5             
##  [43] RSpectra_0.16-2         irlba_2.3.5.1           RSQLite_2.3.9          
##  [46] labeling_0.4.3          filelock_1.0.3          progressr_0.15.1       
##  [49] spatstat.sparse_3.1-0   httr_1.4.7              polyclip_1.10-7        
##  [52] abind_1.4-8             compiler_4.4.2          bit64_4.6.0-1          
##  [55] withr_3.0.2             DBI_1.2.3               fastDummies_1.7.5      
##  [58] MASS_7.3-64             rappdirs_0.3.3          tools_4.4.2            
##  [61] lmtest_0.9-40           httpuv_1.6.15           future.apply_1.11.3    
##  [64] goftest_1.2-3           glue_1.8.0              nlme_3.1-167           
##  [67] promises_1.3.2          grid_4.4.2              Rtsne_0.17             
##  [70] cluster_2.1.8           reshape2_1.4.4          generics_0.1.3         
##  [73] gtable_0.3.6            spatstat.data_3.1-4     tidyr_1.3.1            
##  [76] hms_1.1.3               data.table_1.16.4       xml2_1.3.6             
##  [79] XVector_0.46.0          BiocGenerics_0.52.0     spatstat.geom_3.3-5    
##  [82] RcppAnnoy_0.0.22        ggrepel_0.9.6           RANN_2.6.2             
##  [85] pillar_1.10.1           stringr_1.5.1           spam_2.11-1            
##  [88] RcppHNSW_0.6.0          later_1.4.1             splines_4.4.2          
##  [91] dplyr_1.1.4             BiocFileCache_2.14.0    lattice_0.22-6         
##  [94] survival_3.8-3          bit_4.5.0.1             deldir_2.0-4           
##  [97] tidyselect_1.2.1        Biostrings_2.74.1       miniUI_0.1.1.1         
## [100] pbapply_1.7-2           knitr_1.49              gridExtra_2.3          
## [103] IRanges_2.40.1          scattermore_1.2         stats4_4.4.2           
## [106] xfun_0.50               Biobase_2.66.0          matrixStats_1.5.0      
## [109] UCSC.utils_1.2.0        stringi_1.8.4           lazyeval_0.2.2         
## [112] yaml_2.3.10             evaluate_1.0.3          codetools_0.2-20       
## [115] tibble_3.2.1            cli_3.6.3               uwot_0.2.2             
## [118] xtable_1.8-4            reticulate_1.40.0       munsell_0.5.1          
## [121] jquerylib_0.1.4         GenomeInfoDb_1.42.3     Rcpp_1.0.14            
## [124] globals_0.16.3          spatstat.random_3.3-2   dbplyr_2.5.0           
## [127] png_0.1-8               spatstat.univar_3.1-1   parallel_4.4.2         
## [130] blob_1.2.4              presto_1.0.0            prettyunits_1.2.0      
## [133] dotCall64_1.2           listenv_0.9.1           ggthemes_5.1.0         
## [136] viridisLite_0.4.2       scales_1.3.0            ggridges_0.5.6         
## [139] crayon_1.5.3            purrr_1.0.2             rlang_1.1.5            
## [142] KEGGREST_1.46.0         cowplot_1.1.3